- CNN
- Encoder-Decoder
- Attention
- Transformers
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
Example: “this takes too long” compute vectors for:
This takes, takes too, too long, this takes too, takes too long, this takes too long
Sequential model is used to build a linear stack of layers.import numpy as np import keras from keras.models import Sequential from keras.layers import Dense, Flatten from keras.layers import Conv2D, MaxPooling2D from keras.optimizers import SGD 
Note:
Dense is the fully connected layer;
Flatten is used after all CNN layers
and before fully connected layer;
Conv2D is the 2D convolution layer;
MaxPooling2D is the 2D max pooling layer;
SGD is stochastic gradient descent algorithm.
Most significant change: new set of weights, U - connect the hidden layer from the previous time step to the current hidden layer. - determine how the network should make use of past context in calculating the output for the current input.
Abstracting away from these choices
Widely used encoder design: stacked Bi-LSTMs - Contextualized representations for each time step: hidden states from top layers from the forward and backward passes
Context vector \(c\): function of \(h_{1:n}\) and conveys the essence of the input to the decoder.
Context vector \(c\): function of \(h_{1:n}\) and conveys the essence of the input to the decoder.
Flexible?
- Different for each \(h_i\) - Flexibly combining the \(h_j\)
Ideas: - should be a linear combination of those states \[c_i = \sum_j{\alpha_{ij}h^e_j}\] - \(\alpha_{ij}\) should depend on?
Compute a vector of scores that capture the relevance of each encoder hidden state to the decoder state \(h_{i-1}^d\) \[score(h_{i-1}^d, h_j^e)\]
Just the similarity \[score(h_{i-1}^d, h_j^e) = h_{i-1}^d \cdot h_j^e\]
Give network the ability to learn which aspects of similarity between the decoder and encoder states are important to the current application.
\[score(h_{i-1}^d, h_j^e) = h_{i-1}^d W_S h_j^e\]
\[ \begin{align} a_{ij} &= \text{softmax}(score(h_{i-1}^d, h_j^e)\ \forall j \in e) \\ &= \frac{exp(score(h_{i-1}^d, h_j^e))}{\sum_k{exp(score(h_{i-1}^d, h_k^e))}} \end{align} \]
Just an introduction: These are two valuable resources to learn more details on the architecture and implementation
https://jalammar.github.io/illustrated-transformer/ (slides come from this source)
Key property of Transformer: word in each position flows through its own path in the encoder. - There are dependencies between these paths in the self-attention layer. - Feed-forward layer does not have those dependencies => various paths can be executed in parallel !
While processing each word it allows to look at other positions in the input sequence for clues to build a better encoding for this word.
Step1: create three vectors from each of the encoder’s input vectors:
Query, a Key, Value (typically smaller dimension).
by multiplying the embedding by three matrices that we trained during the training process.
Step 2: calculate a score (like we have seen for regular attention!) how much focus to place on other parts of the input sentence as we encode a word at a certain position.
Take dot product of the query vector with the key vector of the respective word we’re scoring.
E.g., Processing the self-attention for word “Thinking” in position \(\text{#}1\), the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
Intuition: softmax score determines how much each word will be expressed at this position.
Step6 : sum up the weighted value vectors. This produces the output of the self-attention layer at this position
Step6 : sum up the weighted value vectors. This produces the output of the self-attention layer at this position
More details: - What we have seen for a word is done for all words (using matrices) - Need to encode position of words - And improved using a mechanism called “multi-headed” attention
(kind of like multiple filters for CNN)
see https://jalammar.github.io/illustrated-transformer/